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Abstract 

We investigate a local reparameterizaton technique for greatly reducing the vari¬ 
ance of stochastic gradients for variational Bayesian inference (SGVB) of a pos¬ 
terior over model parameters, while retaining parallelizability. This local repa¬ 
rameterization translates uncertainty about global parameters into local noise that 
is independent across datapoints in the minibatch. Such parameterizations can be 
trivially parallelized and have variance that is inversely proportional to the mini¬ 
batch size, generally leading to much faster convergence. Additionally, we explore 
a connection with dropout: Gaussian dropout objectives correspond to SGVB with 
local reparameterization, a scale-invariant prior and proportionally fixed posterior 
variance. Our method allows inference of more flexibly parameterized posteriors; 
specifically, we propose variational dropout, a generalization of Gaussian dropout 
where the dropout rates are learned, often leading to better models. The method 
is demonstrated through several experiments. 


1 Introduction 

Deep neural networks are a flexible family of models that easily scale to millions of parameters and 
datapoints, but are still tractable to optimize using minibatch-based stochastic gradient ascent. Due 
to their high flexibility, neural networks have the capacity to fit a wide diversity of nonlinear patterns 
in the data. This flexbility often leads to overfitting when left unchecked: spurious patterns are found 
that happen to fit well to the training data, but are not predictive for new data. Various regularization 
techniques for controlling this overfitting are used in practice; a currently popular and empirically 
effective technique being dropout In | |22| it was shown that regular (binary) dropout has a 
Gaussian approximation called Gaussian dropout with virtually identical regularization performance 
but much faster convergence. In section 5 of ll22l it is shown that Gaussian dropout optimizes a lower 
bound on the marginal likelihood of the data. In this paper we show that a relationship between 
dropout and Bayesian inference can be extended and exploited to greatly improve the efficiency of 
variational Bayesian inference on the model parameters. This work has a direct interpretation as a 
generalization of Gaussian dropout, with the same fast convergence but now with the freedom to 
specify more flexibly parameterized posterior distributions. 

Bayesian posterior inference over the neural network parameters is a theoretically attractive method 
for controlling overfitting; exact inference is computationally intractable, but efficient approximate 
schemes can be designed. Markov Chain Monte Carlo (MCMC) is a class of approximate inference 
methods with asymptotic guarantees, pioneered by lH6l for the application of regularizing neural 
networks. Later useful refinements include ||23l and fin. 

An alternative to MCMC is variational inference CD or the equivalent minimum description length 
(MDL) framework. Modern variants of stochastic variational inference have been applied to neural 
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networks with some succes HI, but have been limited by high variance in the gradients. Despite 
their theoretical attractiveness, Bayesian methods for inferring a posterior distribution over neural 
network weights have not yet been shown to outperform simpler methods such as dropout. Even a 
new crop of efficient variational inference algorithms based on stochastic gradients with minibatches 
of data GSH3ED have not yet been shown to significantly improve upon simpler dropout-based 
regularization. 

In section[2]we explore an as yet unexploited trick for improving the efficiency of stochastic gradient- 
based variational inference with minibatches of data, by translating uncertainty about global param¬ 
eters into local noise that is independent across datapoints in the minibatch. The resulting method 
has an optimization speed on the same level as fast dropout (2 2), and indeed has the original Gaus¬ 
sian dropout method as a special case. An advantage of our method is that it allows for full Bayesian 
analysis of the model, and that it’s significantly more flexible than standard dropout. The approach 
presented here is closely related to several popular methods in the literature that regularize by adding 
random noise; these relationships are discussed in section[4] 

2 Efficient and Practical Bayesian Inference 

We consider Bayesian analysis of a dataset V, containing a set of N i.i.d. observations of tuples 
(x, y), where the goal is to learn a model with parameters or weights w of the conditional probabil¬ 
ity p( y| x > w ) (standard classification or regression^] Bayesian inference in such a model consists 
of updating some initial belief over parameters w in the form of a prior distribution p( w), after 
observing data V , into an updated belief over these parameters in the form of (an approximation 
to) the posterior distribution p(-w\T>). Computing the true posterior distribution through Bayes’ rule 
p( w|2?) = p(yt)p(V\w)/p(V) involves computationally intractable integrals, so good approxima¬ 
tions are necessary. In variational inference , inference is cast as an optimization problem where we 
optimize the parameters <f> of some parameterized model (j 0 (w) such that q t j > ( w) is a close approx¬ 
imation to p{ w|2?) as measured by the Kullback-Leibler divergence DKL(q<t>(w)\ \pfw\V)). This 
divergence of our posterior q c j > ( w) to the true posterior is minimized in practice by maximizing the 
so-called variational lower bound £(</>) of the marginal likelihood of the data: 

£(4>) = --Djcl(^(w)||p(w)) + L v {<j>) (1) 

where L D (<f>) = ^ E 90(w) [logp(y|x, w)] (2) 

(x.y)G v 

We’ll call Lxi((f>) the expected log-likelihood. The bound plus -DirL(< 7 g>(w)||p(w| 2 ?)) equals 
the (conditional) marginal log-likelihood Yl( x y ) g £> log p(y |x). Since this marginal log-likelihood 
is constant w.r.t. tf>, maximizing the bound w.r.t. <j> will minimize £)/fi(q , 0 (w)||p(w|P)). 

2.1 Stochastic Gradient Variational Bayes (SGVB) 

Various algorithms for gradient-based optimization of the variational bound (eq. U}) with differ¬ 
entiable q and p exist. See section [4] for an overview. A recently proposed efficient method for 
minibatch-based optimization with differentiable models is the stochastic gradient variational Bayes 
(SGVB) method introduced in lfl4ll (especially appendix F) and fI71 . The basic trick in SGVB is 
to parameterize the random parameters w ~ g^(w) as: w = where /(.) is a differen¬ 

tiable function and e ~ p(e) is a random noise variable. In this new parameterisation, an unbiased 
differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood can be formed: 

AT f'f 

Lvif) ~ £i, GVB 0) = — ^logp(y®|x®,w = /(e,(/>)), (3) 

i =1 

where (x®, y ®)(^ 1 is a minibatch of data with M random datapoints (x®, y®) ~ V, and e is a noise 
vector drawn from the noise distribution p(e). We’ll assume that the remaining term in the varia¬ 
tional lower bound, £>ifL( 5 , <^(w)||p(w)), can be computed deterministically, but otherwise it may 
be approximated similarly. The estimator Q is differentiable w.r.t. <[> and unbiased, so its gradient 

'Note that the described method is not limited to classification or regression and is straightforward to apply 
to other modeling settings like unsupervised models and temporal models. 
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is also unbiased: V ^,Lx>((j>) — ^L ^ VB We can proceed with variational Bayesian inference 
by randomly initializing cp and performing stochastic gradient ascent on £(<P) 0 

2.2 Variance of the SGVB estimator 

The theory of stochastic approximation tells us that stochastic gradient ascent using ([3]) will asymp¬ 
totically converge to a local optimum for an appropriately declining step size and sufficient weight 
updates lfl 8 l . but in practice the performance of stochastic gradient ascent crucially depends on 
the variance of the gradients. If this variance is too large, stochastic gradient descent will fail 
to make much progress in any reasonable amount of time. Our objective function consists of an 
expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term 
-Dk\l(< 70 (w)||p(w)) that we assume can be calculated analytically and otherwise be approximated 
with Monte Carlo with similar reparameterization. 

Assume that we draw minibatches of datapoints with replacement; see appendix [F] for a similar 
analysis for minibatches without replacement. Using Li as shorthand for logp(y* |x*, w = f(e t , (f>)), 
the contribution to the likelihood for the z-th datapoint in the minibatch, the Monte Carlo estimator 
0 may be rewritten as Tff VB (</>) = jj Y^fLi Li, whose variance is given by 

2 M M M 

Var [L 5 ™^)] ~ ( £ Var [L<] + 2 £ £ Cov [L u Lj\ ) (4) 

i =1 i =1 j=i +1 

=iV 2 (^Var [Li] + ^^Cov [U, Lj] ), (5) 

where the variances and covariances are w.r.t. both the data distribution and e distribution, i.e. 
Var [Li] = Var e x i )y i [logp(y i |x®, w = /(e, </>))], with x/y* drawn from the empirical distribu¬ 
tion defined by the training set. As can be seen from 0, the total contribution to the variance by 
Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the 
covariances does not decrease with M. In practice, this means that the variance of /,|j iVB (c)) can be 
dominated by the covariances for even moderately large M. 

2.3 Local Reparameterization Trick 

We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari¬ 
ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally 
efficient by not sampling e directly, but only sampling the intermediate variables /(e) through which 
e influences L^ VB ((f>). By doing so, the global uncertainty in the weights is translated into a form 
of local uncertainty that is independent across examples and easier to sample. We refer to such a 
reparameterization from global noise to local noise as the local reparameterization trick. Whenever 
a source of global noise can be translated to local noise in the intermediate states of computation 
(e —>• /(e)), a local reparameterization can be applied to yield a computationally and statistically 
efficient gradient estimator. 

Such local reparameterization applies to a fairly large family of models, but is best explained through 
a simple example: Consider a standard fully connected neural network containing a hidden layer 
consisting of 1000 neurons. This layer receives an M x 1000 input feature matrix A from the layer 
below, which is multiplied by a 1000 x 1000 weight matrix W, before a nonlinearity is applied, 
i.e. B = AW. We then specify the posterior approximation on the weights to be a fully factor¬ 
ized Gaussian, i.e. q^(wij) = N(pi.j, of •) Vwtjj € W, which means the weights are sampled as 
Wij = pij + cri,je.i,j, with j ~ N( 0,1). In this case we could make sure that Cov [L,, L./ = 0 
by sampling a separate weight matrix W for each example in the minibatch, but this is not com¬ 
putationally efficient: we would need to sample M million random numbers for just a single layer 
of the neural network. Even if this could be done efficiently, the computation following this step 
would become much harder: Where we originally performed a simple matrix-matrix product of the 
form B = AW, this now turns into M separate local vector-matrix products. The theoretical com¬ 
plexity of this computation is higher, but, more importantly, such a computation can usually not be 
performed in parallel using fast device-optimized BLAS (Basic Linear Algebra Subprograms). This 
also happens with other neural network architectures such as convolutional neural networks, where 
optimized libraries for convolution cannot deal with separate filter matrices per example. 
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Fortunately, the weights (and therefore e) only influence the expected log likelihood through the 
neuron activations B, which are of much lower dimension. If we can therefore sample the random 
activations B directly, without sampling W or e, we may obtain an efficient Monte Carlo estimator 
at a much lower cost. For a factorized Gaussian posterior on the weights, the posterior for the 
activations (conditional on the input A) is also factorized Gaussian: 

= i ^i,j ) ^ r = iV('y m j , S m j ), with 

1000 1000 

1m, j = 'y ' a m,i^i,j i $ m,j = y ' ^m,i^i,j' ( 6 ) 

j=l i=l 

Rather than sampling the Gaussian weights and then computing the resulting activations, we may 
thus sample the activations from their implied Gaussian distribution directly, using b m j = j + 
with Cmj ~ iV(0,1). Here, £ is an M x 1000 matrix, so we only need to sample M 
thousand random variables instead of M million: a thousand fold savings. 


In addition to yielding a gradient estimator that is more computationally efficient than drawing sep¬ 
arate weight matrices for each training example, the local reparameterization trick also leads to an 
estimator that has lower variance. To see why, consider the stochastic gradient estimate with respect 
to the posterior parameter of j for a minibatch of size M = 1. Drawing random weights W, we get 

p)T SGVB f)T SGVB , n 

UJ - J D _ U±J T> c z, < 7 u '?n,i 

dafj db m>j 2(7; j 

If, on the other hand, we form the same gradient using the local reparameterization trick, we get 

7) T SGVB or SGVB f .„2 

ulj 'D _ alj T> /m 

d(J l 3 ~ db mJ 2^/Sffj' 

Here, there are two stochastic terms: The first is the backpropagated gradient dL^ WB /db m j, and 
the second is the sampled random noise (e; ; - or ( m ,j)- Estimating the gradient with respect to a( y 
then basically comes down to estimating the covariance between these two terms. This is much 
easier to do for Q r , :1 as there are much fewer of these: individually they have higher correlation 
with the backpropagated gradient so the covariance is easier to estimate. In other 

words, measuring the effect of on dL^f VB /db m j is easy as is the only random variable 
directly influencing this gradient via b m j. On the other hand, when sampling random weights, 
there are a thousand e;j influencing each gradient term, so their individual effects get lost in the 
noise. In appendix[D]we make this argument more rigorous, and in section[5]we show that it holds 
experimentally. 


3 Variational Dropout 

Dropout is a technique for regularization of neural network parameters, which works by adding 
multiplicative noise to the input of each layer of the neural network during optimization. Using the 
notation of section [23] for a fully connected neural network dropout corresponds to: 

B = (A ° £)9, with ~ (9) 

where A is the M x K matrix of input features for the current minibatch, 9 is a K x L weight ma¬ 
trix, and B is the M x L output matrix for the current layer (before a nonlinearity is applied). The 
o symbol denotes the elementwise (Hadamard) product of the input matrix with a M x I\ matrix 
of independent noise variables f. By adding noise to the input during training, the weight parame¬ 
ters 9 are less likely to overfit to the training data, as shown empirically by previous publications. 
Originally, Go) proposed drawing the elements of £ from a Bernoulli distribution with probability 
1 — p, with p the dropout rate. Later it was shown that using a continuous distribution with the same 
relative mean and variance, such as a Gaussian A(l, a) with a = p/(\—p), works as well or better 

GO). 

Here, we re-interpret dropout with continuous noise as a variational method, and propose a gen¬ 
eralization that we call variational dropout. In developing variational dropout we provide a firm 
Bayesian justification for dropout training by deriving its implicit prior distribution and variational 
objective. This new interpretation allows us to propose several useful extensions to dropout, such as 
a principled way of making the normally fixed dropout rates p adaptive to the data. 
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3.1 Variational dropout with independent weight noise 


If the elements of the noise matrix £ are drawn independently from a Gaussian N( 1 , a), the marginal 
distributions of the activations b rrl ] G B are Gaussian as well: 

K K 

Q(t> ( b rn .j | A) — ) ^m,j ) > with r 'frn.j — ^ : and — Oi ^ ^ j • (10) 

i—1 i= 1 

Making use of this fact, l22ll proposed Gaussian dropout , a regularization method where, instead 
of applying |9]l, the activations are directly drawn from their (approximate or exact) marginal distri¬ 
butions as given by m- eh argued that these marginal distributions are exact for Gaussian noise 
£, and for Bernoulli noise still approximately Gaussian because of the central limit theorem. This 
ignores the dependencies between the different elements of B, as present using but l22l report 
good results nonetheless. 


As noted by l22l . and explained in appendix[B] this Gaussian dropout noise can also be interpreted 
as arising from a Bayesian treatment of a neural network with weights W that multiply the input to 
give B = AW, where the posterior distribution of the weights is given by a factorized Gaussian with 
qadf j). From this perspective, the marginal distributions ( fT0| th en arise through 


the application of the local reparameterization trick, as introduced in section 2.3 The variational 


objective corresponding to this interpretation is discussed in section 3.3 


3.2 Variational dropout with correlated weight noise 

Instead of ignoring the dependencies of the activation noise, as in section M we may retain the 
dependencies by interpreting dropout <[9]) as a form of correlated weight noise: 

B = (A o £)0, £ i>:i ~ N( 1, a) <=4- b m = a m W, with 

W = (wj, w ' 2 ,..., w' K )', and w* = srfi, with q l p{s i ) = N(l,a), (11) 

where a™ is a row of the input matrix and b m a row of the output. The w, are the rows of the 
weight matrix, each of which is constructed by multiplying a non-stochastic parameter vector 9i by 
a stochastic scale variable .s,;. The distribution on these scale variables we interpret as a Bayesian 
posterior distribution. The weight parameters 9i (and the biases) are estimated using maximum 
likelihood. The original Gaussian dropout sampling procedure (|9} can then be interpreted as arising 
from a local reparameterization of our posterior on the weights W. 


3.3 Dropout’s scale-invariant prior and variational objective 


The posterior distributions W) proposed in sections 3.1 and 3.2 have in common that they can 


be decomposed into a parameter vector 9 that captures the mean, and a multiplicative noise term 
determined by parameters a. Any posterior distribution on W for which the noise enters this mul¬ 
tiplicative way, we will call a dropout posterior. Note that many common distributions, such as 
univariate Gaussians (with nonzero mean), can be reparameterized to meet this requirement. 


During dropout training, 9 is adapted to maximize the expected log likelihood E 9o \L-d {())]■ For this 
to be consistent with the optimization of a variational lower bound of the form in |2|, the prior on 
the weights p( w) has to be such that f?A’L(q i 0 (w)||p(w)) does not depend on 9. In appendix [c] we 
show that the only prior that meets this requirement is the scale invariant log-uniform prior: 

p(k>g(K,j|)) oc c, 

i.e. a prior that is uniform on the log-scale of the weights (or the weight-scales s,; for section [T2| i. As 
explained in appendix [A] this prior has an interesting connection with the floating point format for 
storing numbers: From an MDL perspective, the floating point format is optimal for communicating 
numbers drawn from this prior. Conversely, the KL divergence DkL( q</>(^)\\p{y^)) with this prior 
has a natural interpretation as regularizing the number of significant digits our posterior q, ri stores 
for the weights Wij in the floating-point format. 


Putting the expected log likelihood and KL-divergence penalty together, we see that dropout training 
maximizes the following variatonal lower bound w.r.t. 9: 

[ l t>{0)\ - D KL (q a (w)\\p(Mv)), (12) 
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where we have made the dependence on the 6 and a parameters explicit. The noise parameters a 
(e.g. the dropout rates) are commonly treated as hyperparameters that are kept fixed during training. 
For the log-uniform prior this then corresponds to a fixed limit on the number of significant digits 
we can learn for each of the weights In section 3.4 we discuss the possibility of making this 


limit adaptive by also maximizing the lower bound with respect to a. 

For the choice of a factorized Gaussian approximate posterior with = Af(0ij,a6?j), as 

discussed in section 3.1 the lower bound ( fj~2| i is analyzed in detail in appendix [C] There, it is shown 
that for this particular choice of posterior the negative KL-divergence — -DjrLFZa (w)||p(w)) is n °t 
analytically tractable, but can be approximated extremely accurately using 


-~D KL [q ( / ) (wi)\p(w i )] ~ constant + 0.51og(a) + C\a + c 2 cr 2 + c 3 a 3 


with 


ci = 1.16145124, c 2 = -1.50204118, c 3 = 0.58629921. 


The same expression may be used to calculate the corresponding term — D^-£(g Q (s)||p(s)) for the 


posterior approximation of section 3.2 


3.4 Adaptive regularization through optimizing the dropout rate 

The noise parameters a used in dropout training (e.g. the dropout rates) are usually treated as fixed 
hyperparameters, but now that we have derived dropout’s variational objective making these 
parameters adaptive is trivial: simply maximize the variational lower bound with respect to a. We 
can use this to learn a separate dropout rate per layer, per neuron, of even per separate weight. In 
section[5]we look at the predictive performance obtained by making a adaptive. 

We found that very large values of a correspond to local optima from which it is hard to escape due 
to large-variance gradients. To avoid such local optima, we found it beneficial to set a constraint 
a < 1 during training, i.e. we maximize the posterior variance at the square of the posterior mean, 
which corresponds to a dropout rate of 0.5. 


4 Related Work 

Pioneering work in practical variational inference for neural networks was done in 0, where a 
(biased) variational lower bound estimator was introduced with good results on recurrent neural net¬ 
work models. In later work used it was shown that even more practical estimators can be formed 
for most types of continuous latent variables or parameters using a (non-local) reparameterization 
trick, leading to efficient and unbiased stochastic gradient-based variational inference. These works 
focused on an application to latent-variable inference; extensive empirical results on inference of 
global model parameters were reported in 0, including succesful application to reinforcement 
learning. These earlier works used the relatively high-variance estimator 0, upon which we im¬ 
prove. Variable reparameterizations have a long history in the statistics literature, but have only 
recently found use for efficient gradient-based machine learning and inference HI fl3l [191 . Related 
is also probabilistic backpropagation 0, an algorithm for inferring marginal posterior probabilities; 
however, it requires certain tractabilities in the network making it insuitable for the type of models 
under consideration in this paper. 

As we show here, regularization by dropout lf20l [22 ] can be interpreted as variational inference. 
DropConnect |2T1 is similar to dropout, but with binary noise on the weights rather than hidden units. 
DropConnect thus has a similar interpretation as variational inference, with a uniform prior over the 
weights, and a mixture of two Dirac peaks as posterior. In (2), standout was introduced, a variation 
of dropout where a binary belief network is learned for producing dropout rates. Recently, EH 
proposed another Bayesian perspective on dropout. In recent work 0, a similar reparameterization 
is described and used for variational inference; their focus is on closed-form approximations of the 
variational bound, rather than unbiased Monte Carlo estimators. El and 0 also investigate a 
Bayesian perspective on dropout, but focus on the binary variant. 0 reports various encouraging 
results on the utility of dropout’s implied prediction uncertainty. 
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5 Experiments 


We compare our method to standard binary dropout and two popular versions of Gaussian dropout, 
which we’ll denote with type A and type B. With Gaussian dropout type A we denote the pre-linear 
Gaussian dropout from l20l ; type B denotes the post-linear Gaussian dropout from l22l . This way, 
the method names correspond to the matrix names in section 2 (A or B) where noise is injected. 
Models were implemented in Theano |5), and optimization was performed using Adam [fT2j with 
default hyper-parameters and temporal averaging. 


Two types of variational dropout were included. Type A is correlated weight noise as introduced 
an adaptive version of Gaussian dropout type A. Variational dropout type B has 


in section 


3.2 


independent weight uncertainty as introduced in section [XT] and corresponds to Gaussian dropout 
type B. 


A de facto standard benchmark for regularization methods is the task of MNIST hand-written digit 
classification. We choose the same architecture as l20l : a fully connected neural network with 3 
hidden layers and rectified linear units (ReLUs). We follow the dropout hyper-parameter recom¬ 
mendations from these earlier publications, which is a dropout rate of p = 0.5 for the hidden layers 
and p = 0.2 for the input layer. We used early stopping with all methods, where the amount of 
epochs to run was determined based on performance on a validation set. 


Variance. We start out by empirically comparing the variance of the different available stochastic 
estimators of the gradient of our variational objective. To do this we train the neural network de¬ 
scribed above for either 10 epochs (test error 3%) or 100 epochs (test error 1.3%), using variational 
dropout with independent weight noise. After training, we calculate the gradients for the weights of 
the top and bottom level of our network on the full training set, and compare against the gradient 
estimates per batch of M = 1000 training examples. Appendix[E]contains the same analysis for the 
case of variational dropout with correlated weight noise. 

Table[l]shows that the local reparameterization trick yields the lowest variance among all variational 
dropout estimators for all conditions, although it is still substantially higher compared to not hav¬ 
ing any dropout regularization. The 1/M variance scaling achieved by our estimator is especially 
important early on in the optimization when it makes the largest difference (compare weight sample 
per minibatcli and weight sample per data point). The additional variance reduction obtained by our 
estimator through drawing fewer random numbers (section [273] ) is about a factor of 2, and this re¬ 
mains relatively stable as training progresses (compare local reparameterization and weight sample 
per data point). 


top layer 

stochastic gradient estimator 10 epochs 

local reparameterization (ours) 7.8 x 10 3 

weight sample per data point (slow) 1.4 x 10 4 

weight sample per minibatch (standard) 4.9 x 10 4 

no dropout noise (minimal var.) 2.8 x 10 3 


top layer 
100 epochs 

1.2 x 10 3 

2.6 x 10 3 

4.3 x 10 3 
5.9 x 10 1 


bottom layer bottom layer 
10 epochs 100 epochs 

1.9 x 10- 1.1 x 10 2 

4.3 x 10 2 2.5 x 10 2 

8.5 x 10 2 3.3 x 10 2 

1.3 x 10 2 9.0 x 10° 


Table 1: Average empirical variance of minibatch stochastic gradient estimates (1000 examples) for 
a fully connected neural network, regularized by variational dropout with independent weight noise. 


Speed. We compared the regular SGVB estimator, with separate weight samples per datapoint 
with the efficient estimator based on local reparameterization, in terms of wall-clock time efficiency. 
With our implementation on a modern GPU, optimization with the naive estimator took 1635 sec¬ 
onds per epoch, while the efficient estimator took 7.4 seconds: an over 200 fold speedup. 

Classification error. Figure[l]shows test-set classification error for the tested regularization meth¬ 
ods, for various choices of number of hidden units. Our adaptive variational versions of Gaussian 
dropout perform equal or better than their non-adaptive counterparts and standard dropout under all 
tested conditions. The difference is especially noticable for the smaller networks. In these smaller 
networks, we observe that variational dropout infers dropout rates that are on average far lower than 
the dropout rates for larger networks. This adaptivity comes at negligable computational cost. 
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Figure 1: Best viewed in color, (a) Comparison of various dropout methods, when applied to fully- 
connected neural networks for classification on the MNIST dataset. Shown is the classification 
error of networks with 3 hidden layers, averaged over 5 runs, he variational versions of Gaussian 
dropout perform equal or better than their non-adaptive counterparts; the difference is especially 
large with smaller models, where regular dropout often results in severe underfitting, (b) Compar¬ 
ison of dropout methods when applied to convolutional net a trained on the CIFAR-10 dataset, for 
different settings of network size k. The network has two convolutional layers with each 32 k and 
64A: feature maps, respectively, each with stride 2 and followed by a softplus nonlinearity. This is 
followed by two fully connected layers with each 128/.; hidden units. 


We found that slightly downscaling the KL divergence part of the variational objective can be ben¬ 
eficial. Variational (A2) in figure |T| denotes performance of type A variational dropout but with a 
KL-divergence downscaled with a factor of 3; this small modification seems to prevent underfitting, 
and beats all other dropout methods in the tested models. 

6 Conclusion 

Efficiency of posterior inference using stochastic gradient-based variational Bayes (SGVB) can often 
be significantly improved through a local reparameterization where global parameter uncertainty is 
translated into local uncertainty per datapoint. By injecting noise locally, instead of globally at the 
model parameters, we obtain an efficient estimator that has low computational complexity, can be 
trivially parallelized and has low variance. We show how dropout is a special case of SGVB with 
local reparameterization, and suggest variational dropout , a straightforward extension of regular 
dropout where optimal dropout rates are inferred from the data, rather than fixed in advance. We 
report encouraging empirical results. 

Acknowledgments 

We thank the reviewers and Yarin Gal for valuable feedback. Diederik Kingma is supported by the 
Google European Fellowship in Deep Learning, Max Welling is supported by research grants from 
Google and Facebook, and the NWO project in Natural AI (NAI.14.108). 


References 

[1] Ahn, S., Korattikara, A., and Welling, M. (2012). Bayesian posterior sampling via stochastic gradient 
Fisher scoring. arXiv preprint arXiv:1206.6380. 

[2] Ba, J. and Frey, B. (2013). Adaptive dropout for training deep neural networks. In Advances in Neural 
Information Processing Systems, pages 3084-3092. 

[3] Bayer, J., Karol. M„ Korhammer, D., and Van der Smagt, P. (2015). Fast adaptive weight noise. arXiv 
preprint arXiv:1507.05331. 

[4] Bengio, Y. (2013). Estimating or propagating gradients through stochastic neurons. arXiv preprint 
arXiv:1305.2982. 


8 












[5] Bergstra, J., Breuleux, O.. Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, 
D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the 
Python for Scientific Computing Conference (SciPy), volume 4. 

[6] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural net¬ 
works. arXiv preprint arXiv:1505.05424. 

[7] Gal, Y. and Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty 
in deep learning. arXiv preprint arXiv:1506.02142. 

[8] Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information 
Processing Systems, pages 2348-2356. 

[9] Hernandez-Lobato, J. M. and Adams, R. P. (2015). Probabilistic backpropagation for scalable learning of 
Bayesian neural networks. arXiv preprint arXiv:1502.05336. 

[10] Hinton, G. E., Srivastava. N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving 
neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. 

[11] Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the descrip¬ 
tion length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, 
pages 5-13. ACM. 

[12] Kingma, D. and Ba, I. (2015). Adam: A method for stochastic optimization. Proceedings of the Interna¬ 
tional Conference on Learning Representations 2015. 

[13] Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable models in auxiliary 
form. arXiv preprint arXiv:1306.0733. 

[14] Kingma, D. P. and Welling. M. (2014). Auto-encoding variational Bayes. Proceedings of the 2nd Inter¬ 
national Conference on Learning Representations. 

[15] Maeda, S.-i. (2014). A Bayesian encourages dropout. arXiv preprint arXiv:1412.7003. 

[16] Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto. 

[17] Rezende, D. I., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate 
inference in deep generative models. In Proceedings of the 31st International Conference on Machine 
Learning (ICML-14), pages 1278-1286. 

[18] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical 
Statistics, 22(3):400-407. 

[19] Salimans, T. and Knowles, D. A. (2013). Fixed-form variational posterior approximation through stochas¬ 
tic linear regression. Bayesian Analysis, 8(4). 

[20] Srivastava. N., Hinton, G., Krizhevsky, A., Sutskever. I., and Salakhutdinov, R. (2014). Dropout: A simple 
way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929— 
1958. 

[21] Wan, L., Zeiler, M., Zhang, S., Cun, Y. L„ and Fergus, R. (2013). Regularization of neural networks using 
dropconnect. In Proceedings of the 30th International Conference on Machine Learning (1CML-13), pages 
1058-1066. 

[22] Wang, S. and Manning, C. (2013). Fast dropout training. In Proceedings of the 30th International 
Conference on Machine Learning (ICML-13), pages 118-126. 

[23] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In 
Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681-688. 


9 



A Floating-point numbers and compression 


The floating-point format is by far the most common number format used for model parameters in neural 
network type models. The absolute value of floating-point numbers in base 2 equals sign ■ (s/ 2 P-1 ) • 2 e where 
s is the mantissa with p bits and e is the exponent, both non-negative integer-valued. For random bit patterns for 
s and e, floating point numbers closely follow a log-uniform distribution with finite range. The floating-point 
format is thus close to optimal, from a compression point of view, for communicating parameters for which the 
receiver has a log-uniform prior distribution, with finite range, over the magnitude of parameters. Specifically, 
this prior for each individual floating-point parameter Wj is p(log | Wj \) = ZT(interval), where the interval is 
approximately (log(2 -8 ), log(2 8 )) for single-point precision and (log(2 -11 ), log(2 11 )) for double-precision 
floating-point numbers in the most common IEEE specification. 

The KL-divergence between a variational posterior q<f, (eq. 0 ) and this log-uniform prior is relatively simple: 
it equals the entropy under q$ of the log-transformed magnitude of w plus a constant: 

-£ , kl (?</.( w )|| p ( w )) = H{q^{\og |w|)) + constant (13) 

This divergence has a natural interpretation as controlling the number of significant digits under q,f, of the 
corresponding floating-point number. 


B Derivation of dropout’s implicit variational posterior 


In the common version of dropout 1201 . linear operations b’ = a* W in a model, where a ! is a column vector 
that is the result of previous operations on the input for datapoint i, are replaced by a stochastic operation: 


(a l 0(d7(l-p))W 

(14) 

E WjkOjdj/il-p) 

j 

(15) 

Bernoulli(l — p) 

(16) 

dropout rate (hyper-parameter, e.g. 0.5) 

(17) 


where column vector a 1 is the result of previous operations on the input for datapoint i. Expected values and 
variance w.r.t. the dropout rate are: 


E[dj] =(1 -p) 

E [4/(1 -a?)] = 1 

Var \d)/(l - p)] = Var [dj] /(I - p) 2 = p/(l - p) 


The expected value and variance for the elements of the resulting vector b 1 are: 


E 


M 


= E 


J2w jk a)d)/(l-p) 

3 

Y^Wiha) E [d}/(l-p)] 


E^ 




Var 


M 


= Var 


E WjkOjdj/il - p) 

, 3 

E Var [Wjfco}d}/(i-p)] 

3 

E W ffc(4) 2var [47(! -p)] 

3 

p/(! -p)E w lk{a)) 2 


(18) 

(19) 

( 20 ) 
( 21 ) 


( 22 ) 

(23) 

(24) 

(25) 

(26) 

(27) 

(28) 
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B.l Gaussian dropout 

Thus, ignoring dependencies between elements of b\ dropout can be approximated by a Gaussian HD: 

^(b'la*, W) = A/”(a*VK,p(l — p) _1 (a i ) 2 (W) 2 ) (29) 

i.e.: b* = a l W + t © yjp( 1 — p) _1 (a‘) 2 (W) 2 (30) 

where: e~A/’(0,1) (31) 

where (.) 2 , and © are again component-wise operations. 

B.2 Weight uncertainty 

Equivalently, the Gaussian dropout relationship p(b'|a s , W) can be parameterized as weight uncertainty: 


b i = a ! V (32) 

p(Vjk\Wjk,ajk) = N{Wjk, ctjkWjk) (33) 

otjk =p/(l-p) (34) 

=> p(h l \&\W) =Af(a l W,p(l-p)- 1 {a i ) 2 (W) 2 ) (35) 


In the main text, we show how we can treat the distribution over weights p(Vjk\Wjk■ ctjk) as a variational 
posterior q which we can optimize w.r.t. each ctjk (the dropout rate for individual parameters) to optimize a 
bound on the marginal likelihood of the data. 

C Negative KL-divergence for the log-uniform prior 

The variational lower bound 0 that we wish to optimize is given by 

C{4>) = -D KL (q^(-w)\\p(w)) + Lt,(< p), 

where is the expected log likelihood term which we approximate using Monte Carlo, as dis¬ 

cussed in section [273] Here we discuss how to deterministically compute the negative KL divergence term 
—-Dif_L(g</,(w)||p(w)) for the log-uniform prior proposed in section]®] For simplicity we only discuss the case 
of factorizing approximate posterior w) = although the extension to more involved posterior 

approximations is straight forward. 

Our log-uniform prior p(wi) (appendix | a) consists of a Bernoulli distribution on the sign bit Si G { — 1, +1} 
which captures the sign of Wi, combinedwith a uniform prior on the log-scale of Wi\ 

Wi = s. Inti | 

p(si) = Bernoulli(0.5) on { — 1, +1} 
p(log(|wi|)) oc c (36) 

For an approximate posterior q^iw,) with support on the real line, the KL divergence will thus consist of two 
parts: the KL-divergence between the posterior and prior on the sign bit Si and the KL-divergence between the 
conditional posterior and prior on the log scale log(|u>i|). 

The negative KL divergence between the posterior and prior on the sign bit is given by: 

-D K L[q</,{si)\p(si)\ = log(0.5) - Q^O) log[Q 0 (O)] - [1 - Q40)] log[l - <240)], (37) 

with Q40) the posterior CDF of Wi evaluated at 0, i.e. the probability of drawing a negative weight from our 
approximate posterior. 

The negative KL divergence of the second part (the log-scale) is then 

-L>Kife(log(|wi|)|si)b(log(|mi|))] = log(c) - <240)E 9 4,( Wi K<0) [ lo g(^(wti)/<2<^(0)) + log(|«4)] 

- (1 - <3^>(0))E^ (ro .| ro . >0) [log(q4wi)/(l - <240))) + log(|wi|)] , 

where q,/,(wi)/Q^(0) is the truncation of our approximate posterior to the negative part of the real-line, and 
the log(|«>, |) term enters through transforming the uniform prior from the log-space to the normal space. 

Putting both parts together the terms involving the CDF Q40) cancel, and we get: 

-D K L[q^(wi)\p{wi)] = log(c) + log(0.5) +'H(q^(w i )) - E, log(|wi|), (38) 

where T-L(q^^Wi)) is the entropy of our approximate posterior. 
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We now consider the special case of a dropout posterior , as proposed in section [3] which we defined as any 
approximate posterior distribution q&(wi) for which the weight noise is multiplicative: 

Wi = Om, with e; ~ q a (ti), (39) 

where we have divided the parameters <f>i = ( 9i,a ) into parameters governing the mean of q^,(wi ) (assuming 
EqCi = 1) and the multiplicative noise. Note that certain families of distributions, such as Gaussians, can 
always be reparameterized in this way. 

Assuming q a {ei) is a continuous distribution, this choice of posterior gives us 

U^iwi)) = H(g a (ei)) + log(| 6 >i|), (40) 

where the last term can be understood as the log jacobian determinant of a linear transformation of the random 
variable ei. 

Furthermore, we can rewrite the —E 9 log(|w;|) term of ((38} as 

—Eq log(|wi|) = -E s log(|0;ej|) = — log(|#i|) — E 9 log(|e;|). (41) 

Taking < ]40| ) and ( |4 1 ^ together, the terms depending on 9 thus cancel exactly, proving that the KL divergence 
between posterior and prior is indeed independent of these parameters: 

-D K L[q^{wi)\p(wi)\ = log(e) + log(0.5) + Hiq^iwi)) - E 9 log(|wi|) 

= log(e) + log(0.5) + H(q a {ei)) - E qa log(| ei |). (42) 

In fact, the log-uniform prior is the only prior consistent with dropout. If the prior had any additional terms 
involving Wi, their expectation w.r.t. q ^ would not cancel with the entropy term in —Dkl [q</>(wi)\p(wi)\. 

For the specific case of Gaussian dropout , as proposed in section[3] we have q a (ei) = N(l, a). This then 
gives us 


-D K L[q^(wi)\p(wi)] = log(c) + log(0.5) + 0.5(1 + log(27r) + log(a)) - E 9ct log(|ei|). 

The final term E qa log(| |) in this expression is not analytically tractable, but it can be pre-computed numer¬ 
ically for all values of a in the relevant range and then approximated using a 3rd degree polynomial in a. As 
shown in figure| 2 ] this approximation is extremely close. 

For Gaussian dropout this then gives us: 

—D K L[qd>(wi)\p(wi)\ « constant + 0.51og(a) + cia + cv.cx + 03 a 3 , 

with 

ci = 1.16145124, c 2 = -1.50204118, c 3 = 0.58629921. 

Alternatively, we can make use of the fact that — E 9c( log(|ei|) > 0 for all values of a, and use the following 
lower bound: 

— DKh[q<i>(wi)\p{wi)\ > constant- 1 -0.5log(a). 



Figure 2: Exact and approximate negative KL 
divergence —D KL \p(wi)] for the log- 

uniform prior and Gaussian dropout approximate 
posterior. Since we use an improper prior, the con¬ 
stant is arbitrary; we choose it so that the exact 
KL divergence is 0 at a = 1, which is the max¬ 
imum value we allow during optimization of the 
lower bound. The exact and approximate results 
are indistinguishable on the domain shown, which 
corresponds to dropout rates between p = 0.05 
and p = 0.5. For log(a) -A —oo the approxi¬ 
mated term vanishes, so that both our polynomial 
approximation and the lower bound become exact. 
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D Variance reduction of local parameterization compared to separate 
weight samples 


In section[23]we claim that our local reparameterization not only is more computationally efficient than drawing 
separate weight matrices for each training example, but that it also leads to an estimator that has lower variance. 
To see why, consider the stochastic gradient estimate with respect to the posterior parameter afj for a minibatch 
of size M = 1. Drawing random weights W to form the estimate, we get 


8L% 


da h 


dbr, 


2 (Ti, 


(43) 


If, on the other hand, we form the same gradient using the local reparameterization trick, we get 

^rSUVB A 2 

UJ-j -p S 


QL SGVB arSGVB a ^ 




db„ 


(44) 


We can decompose the variance of these estimators as 


Var, 




d °h 


= Var i)„ 


E, 


90, 


r SGVB 
J T> 

d °li 


I bm,j 


+ Efo 


Var, 


90, 


da h 


\bm,j 


Here, the first term is identical for both gradient estimators, but the second term is not. When drawing separate 
weight matrices per training example, it is 


Et, 


- 90 , 


8L S ™ 


“hEb 


da h 


- 90 , 


\brn.j 


dL\ 


db„ 


= 

I bm,j 


Var 0 


db 

Uum,j 






(45) 


with q — {bm,j 7 m,j) flm.i/4- 

When using the local reparameterization trick, we simply have 


Eh 


Var, 


90 , 


dL s P GVB 

da h 


| br, 


= E^ 


Var, 


90 , 


dbr, 


(46) 


In this case, the second term vanishes because the random variable C, m ,j is uniquely determined given b mj ', 
as there is only a single for each bmj. Because there are many e,., for each 6 m ,j, this variable is not 
uniquely determined by conditioning on b m j, so there is an additional (positive) contribution to the variance. 


E Variance of stochastic gradients for variational dropout with correlated 
weight noise 

In section [5] we empirically compare the variance of the different available stochastic estimators of the gra¬ 
dient of our variational objective for a model regularized using variational dropout with independent weight 
noise (sectiorJTTJ. Here we present the same analysis for variational dropout with correlated weight noise 
(sectio r|3.2[ l. As table[2]shows, the relative performance of the different estimators is similar to that reported in 
section pjal though the noise level is generally much higher using this regularization method. 

top layer top layer bottom layer bottom layer 
stochastic gradient estimator 10 epochs 100 epochs 10 epochs 100 epochs 

local reparameterization (ours) 2.9 x 10 4 5.8 x 10 a 8.8 x 10 2 5.3 x 10^ 

weight sample per minibatch (standard) 3.1 x 10 5 2.0 x 10 4 5.0 x 10 3 1.1 x 10 3 

no dropout noise (minimal var.) 7.1 x 10 3 9.8 x 10 2 4.3 x 10 2 1.2 x 10 2 

Table 2: Average empirical variance of minibatch stochastic gradient estimates (1000 examples) for 
a fully connected neural network, regularized by variational dropout with correlated weight noise. 
For this specification, the variance of the stochastic gradient estimator with separate weight samples 
for each data point is identical to that using the local reparameterization trick: Ignoring optimization 
of dropout rates, both methods correspond to the standard dropout training procedure. 
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F Variance of SGVB estimator with minibatches of datapoints without 
replacement 

Using the indicator variables .s, £ S, to denote whether the i-th training example is included in our minibatch, 
and Li(e, <j >) as shorthand for logp(y*|x l , w = /(e, <j>)), the Monte Carlo estimator 0 may be rewritten as 

N 

= (47 ) 


The variance of which is given by 

Var EjS [l“ VB (</>)] Vars [ Si l ^^2 + Var£ 0)] E » [ s ?] 

i= 1 
o/V 2 

+ -M 2 E Es M E * M Cov e [Li (e, 0), L 3 - (e, $] 

i>j 

+2^ c 0Ve[L i (e^),L j (e»], (48) 

\«=1 / i>j 

where the inequalities arise through ignoring the negative correlations between the Si, and where we use that 

E s [si] = Eg [s?] = M/N, and Vars [s;] < M/N, for N > 2 M. 

As can be seen, the variance contributed by the random selection of data points from V is inversely proportional 
to the minibatch size M. However, the variance contributed by our random sampling of e does not decrease with 
M, as we are using a single random sample for the entire minibatch and Cov f [Li(e, <j>), Lj(e, </>)] is positive 
on averag^j In practice, this means that the variance of L^/ WB (<j>) can be dominated by e for even moderately 
large M. 


2 We have E x i jy ; [Li(e, </>)] = E xJ - j [Lj(e, (/))] if the examples in the training data are i.i.d., which means 


that E.. 


,y l ,x-7 ,y3 


[Cov e [Li(e,<f>),Lj(e,(j))]] = Var E [E x i iyi [Li(e, <$>)]} > 0. 
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